IDSIA-01-05 



16 January 2005 



Master Algorithms for Active Experts Problems 
based on Increasing Loss Values 



in 
O 

o 

(N 
X> 

in 



o 

O 



> 

^sD 
O 
(N 
O 

in 
o 

o 



X 
S3 



Jan Poland 
Marcus Hutter 

IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland 



JAN@IDSIA.CH 
MARCUS@IDSIA.CH 
WWW.IDSIA.CH 



Abstract 

We specify an experts algorithm with the fol- 
lowing characteristics: (a) it uses only feed- 
back from the actions actually chosen (ban- 
dit setup), (b) it can be applied with count- 
ably infinite expert classes, and (c) it copes 
with losses that may grow in time appropri- 
ately slowly. We prove loss bounds against 
an adaptive adversary. From this, we obtain 
master algorithms for "active experts prob- 
lems" , which means that the master's actions 
may influence the behavior of the adversary. 
Our algorithm can significantly outperform 
standard experts algorithms on such prob- 
lems. Finally, we combine it with a univer- 
sal expert class. This results in a (compu- 
tationally infeasible) universal master algo- 
rithm which performs - in a certain sense - 
almost as well as any computable strategy, 
for any online problem. 

Keywords. Prediction with expert advice, 
responsive environments, partial observation 
game, bandits, universal learning, asymp- 
totic optimality. 



1. Introduction 

Expert algorithms have been popular since about fif- 
teen years ago |LW89 . They are appropriate for on- 
line prediction or repeated decision making or repeated 
game playing (we call these setups online problems for 
brevity), based on a class of "experts". In each round, 
each expert gives a recommendation. From this, we de- 
rive a master decision. After that, losses (or rewards) 
are assigned to each expert by the environment, also 
called adversary. Our goal is to perform almost as 
well as the best expert in hindsight in the long run. In 
other words, we try to minimize the regret. 

The early papers deal with the full information game, 



where we get to know the losses of each expert af- 
ter each round. The analysis holds for the worst 
case, where the environment is fully adversarial and 
tries to maximize our regret in the long run. Later, 
ACBFS95 gave a worst-case analysis for the bandit 
setup, where the master algorithm knows only the loss 
of its own decision after each round. This has been 
further generalized to label- efficient prediction [HP97 
and partial monitoring CBLS04 . 

Recently, |FM04| introduced a strategic experts algo- 
rithm which performs well for a broader class of envi- 
ronments. The algorithm has still asymptotically op- 
timal properties against a worst-case adversary. Addi- 
tionally, it may perform much better than a standard 
experts algorithm in more favorable situations, when 
the actions influence the behavior of the environment. 
We refer to these as active experts problems. One ex- 
ample is the repeated prisoner's dilemma when the op- 
ponent is willing to cooperate under certain conditions 
(see Section[5]for some details). However, |FM04) give 
only asymptotic guarantees, but no convergence rate. 

In this paper, we introduce a different algorithm for ac- 
tive experts problems with the same asymptotic guar- 
antees, but in addition a convergence rate (of t~ To) 
is shown. Both algorithm and analysis are assembled 
from a standard "toolkit" , basing on (KV03I |MB04 . 
The basic idea is the following: We use the bandit 
experts algorithm by MBQ^j, but allow the losses to 
increase with time t. This allows us to give control to 
one expert for an increasing period of time steps. 

Secondly, we generalize our analysis to the case of in- 
finitely many experts, basing on HP04b . The master 
algorithm stays computable (if the experts are), since 
only a finite (with time increasing) number of experts 
is involved. Allowing infinitely many experts also per- 
mits to define a universal expert class by means of all 
programs on some universal Turing machine. (This 
construction is quite common in Algorithmic Informa- 
tion Theory, see e.g. |Hut04| .) Thus, we obtain a uni- 



versal master algorithm, which we show to perform in a 
certain sense almost as well as any computable strategy 
on any online problem. Thus, we introduce a new ap- 
proach to universal artificial intelligence, which is in a 
sense dual to the AIXI model based on Bayesian learn- 
ing |Hut04| . Although the master algorithm is com- 
putable, the resulting universal agent is not (like the 
AIXI model), since the experts may be non-responsive. 

The paper is structured as follows. Section [21 intro- 
duces the problem setup, the notation, and the algo- 
rithm. In Sections and 0] we give the (worst-case) 
analysis for finite and infinite expert classes. The im- 
plications to active experts problems and a universal 
master algorithms are given in Section Section 
contains discussion and conclusions. 

2. The Algorithm 

Our task is an online decision problem. That is, we 
have to make a sequence of decisions, each of which 
results in a certain loss we incur. "We" is an abbre- 
viation for the master algorithm which is to be de- 
signed. For concreteness, you may imagine the task of 
playing a game repeatedly. In each round, i.e. at each 
time step t, we have access to the recommendations of 
neNU{oo} "experts" or strategies. We do not specify 
what exactly a "recommendation" is - we just follow 
the advice of one expert. Before we reveal our move, 
the adversary has to assign losses l\ > to all experts 
i. There is an upper bound B t on the maximum loss 
the adversary may use, i.e. £ t E[0,B t } n . This quantity 
may depend on t and is known to us. After the move, 
only the loss of the selected expert i is revealed. This 
is the bandit setup, as opposed to the full information 
game where we get to know the losses all experts. Our 
goal is to perform nearly as well as the best available 
strategy in terms of cumulative loss, after any num- 
ber T of time steps which is not known in advance. 
The difference between our loss and the loss of some 
expert is also termed regret. We consider the general 
case of an adaptive adversary, which may assign losses 
depending on our past decisions. 

If there is a finite number n of experts or strategies, 
then it is common to give no prior preferences to any 
of them. Formally, we define prior weights w l = —. 
Moreover, we define the complexity of expert i as k t = 
— lnu>\ This arises in the full observation game, where 
the regret can be bounded by some function of the 
best expert's complexity. On the other hand, if there 
are reasons not to trust all strategies equally in the 
beginning, we may use a non- uniform prior w. This 
is mandatory for infinitely many experts. We then 
require w l >0 for all experts i and J2i wt — !• 



For t = 1,2,3,... 

Sample r t s{0,l} independently s.t. P[r t = l]=jt 
If r t = Then 

Play FPL(ty S decision (I t FoE :=I t FPL ) 

Set l| = for all l<i<n 
Else 

Sample I t FoE e{l...n} uniformly & play I:=lf oE 
Let i{=l\nl~it and t t = for all i^=I 



Figure 1. The algorithm FoE 



Sample q\ ~ Exp independently for 1 < i < n 
select and play I FPL = arg min {ntt <t + k % — q\} 

Ki<n 



Figure 2. The algorithm FPL(t) 



Our algorithm "Follow or Explore" (FoE) builds on 
McMahan and Blum's online geometric optimization 
algorithm. (For finite n and uniform prior, it even is 
their algorithm, save for the adaptive parameters.) It 
is a bandit version of a "Follow the Perturbed Leader" 
experts algorithm. This approach to online predic- 
tion and playing repeated games has been pioneered 
by |Han57j . For the full observation game, |KV03| 
gave a very elegant analysis which is distinct from the 
standard analysis of exponential weighting schemes. 
It is particularly handy if the learning rate is dynamic 
rather than fixed in advance. A dynamic learning rate 
is necessary if there is no target time T known in ad- 
vance. 

The algorithm is composed of two standard ingredi- 
ents: exploration and follow the (perturbed) leader. 
Since we are playing the bandit game (as opposed to 
the full information game), we need to explore suffi- 
ciently. Otherwise, there could be a strategy which 
we think is poor (and thus never play), but in reality 
it is good. At each time step t, we decide randomly 
according to some exploration rate ji € (0,1) whether 
to explore or not. If so, we choose an expert accord- 
ing to the uniform distribution (or the prior distribu- 
tion, compare (JSJ), in case of non- uniform priors). Af- 
ter observing the loss of the selected expert, we want 
to give an unbiased estimate of the true loss vector. 
We achieve that by dividing the observed loss by the 
probability of exploring this expert, and estimate the 
unobserved losses of all other experts by zero. We call 
the resulting loss vector £ t . 

When not exploring, we follow some strategy which 
performed well in the past. It may be not advisable 
to pick always the best strategy so far - the adver- 



sary could fool us in this case. Instead we intro- 
duce a perturbation for each expert and follow the 
advice of the strategy with the best perturbed score. 
In order to assign a score to each expert, note that 
we have only access to the estimated losses it- Let 
£ 1 <T = Y^t=ih be the estimated cumulative past loss 
of expert i. Then his complexity-penalized score is 
denned as rjT^T + k" 1 , i.e. high scores are bad. Here, 
r\T > is the learning rate. The perturbed score is then 
given by rjxil^x+k 1 — q l , where the perturbations q % are 
chosen independently exponentially distributed. This 
ensures a convenient analysis. 

The algorithms "Follow or Explore" FoE and "Follow 
the perturbed Leader" FPL are fully specified in Fig- 
uresHandEl Note that each time randomness is used, 
it is assumed to be independent of the past random- 
ness. Note also that all algorithms occurring in this 
paper work with the estimated losses I. We may eval- 
uate their performance in terms of true or estimated 
losses, this is specified in the notation. E.g. for the 
true loss of FPL up to and including time T we write 
L FPL =t FP ^, while the estimated loss is 1^=1^. 

3. Analysis for Uniform Prior 

In this section we assume a uniform prior w = — over 
finitely many experts. (The general case is treated in 
the next section.) We assume that B t >0 is some se- 
quence of upper bounds on the true losses, j t £ (0,1) 
is a sequence of exploration rates, and r] t > is a de- 
creasing sequence of learning rates. 

The analysis is according to the following diagram: 

L FoE <EL FoE <EL FPL <F,L FPL <EL IFPL < L best <L best (1) 

The symbol L is used informally for the cumulative 
loss £i;T- Each "<" means that we bound the quan- 
tity on the left by the quantity on the right plus some 
additive terms. The first and the last expressions are 
the losses of the FoE algorithm and the best expert, 
respectively. The intermediate quantities belong to 
different algorithms, namely FoE, FPL, and a third 
one called IFPL for "infeasible" FPL |KV03| . IFPL is 
the same as FPL except that it has access to an oracle 
providing the current estimated loss vector £ t (hence 
infeasible). Then it assigns scores of T]t£\. t + k l — q\ in- 
stead of r\tl % <t -Vk % — q\. We assume that IFPL uses the 
same randomization as FPL (i.e. the respective qt are 
the same). 

The randomization of FoE and FPL gives rise to 
two filters of er-algebras. By At for t > we de- 
note the (T-algebra generated by the FoE's random- 
ness {ui :t ,ri- t } up to time t. We may also write 



.4 = lj t>0 .4 t . Similarly, B t is the c-algebra generated 
by the FoE's and FPUs randomness up to time t (i.e. 
Bt={ui;t,ri:t,qi:t})- Then clearly AtCB t for each t. 

The arguments below rely on conditional expectations 
- the expectations in Q should also be understood 
conditional. In particular we will often need the con- 
ditional expectations with respect to FoE's past ran- 
domness At-i, abbreviated as 

Et[X] :=B[X\A t -i], 

where X is some random variable. Then Ef[X] is an 
At-i -measurable random variable, meaning that its 
value is determined for fixed past randomness At-i- 
Note in particular that the estimated loss vectors l\ 
are random vectors which depend on FoE's random- 
ness At up to time t (only). In this way, FoE's (and 
FPUs and IFPL's) actions depend on FoE's past ran- 
domness. Note, however, that they do not depend on 
FPUs randomness qi :t . Finally, lf oE and £ FoE are A' t 
measurable, i.e. depend on u <t ,r <t ,qt, but are inde- 
pendent of g<t. 

We now start proving the diagram (JTJ. It is helpful 
to consider each intermediate algorithm as a stand- 
alone procedure which is actually executed (with an 
oracle if necessary) and has the asserted performance 
guarantees (e.g. in terms of expected losses). 

Lemma 1 [L FoE : <EL FoB ] For each T > 1 and S T e 
(0,1), with probability at least 1 — we have 

£ F ? E < £ B t l™ + VT21n£)EL^ 2 - 
t—i 

Proof. The sequence of random variables Xt = 
ELi [ £ t° E ~ E^f oB ] is a martingale with respect 
to the filter Bt (not Atl)- In order to see this, 
observe E[^ oB |S T -i] = E(E[^ oB |^ T _ 1 ]|6 T _i) and 
-E[£ FoE \B T -i}=£ FoE for t<T, which implies 

E(X T \B T -i) = 

= EL {W! oE Wt-i] E[E[C B |A-i]|£t-i]) 
= (^° E - E^IA-i]) = Xt—\ • 

Its differences are bounded: \X t — X t -i\<B t . Hence, 
it follows from Azuma's inequality that the probability 
that Xt exceeds some A>0 is bounded by p = 2exp(— 
2 ^ b 1 ) • Requesting %=P and solving for A gives the 
assertion. □ 

The relation 'EL FoB <EL FPL follows immediately from 
the specification of the algorithm FoE. 



Lemma 2 [EL FoE <EL FPL ] For each t>l, 
V t £? oE <(l- lt )-E t ir L + JtB t . 



we nave 



The next lemma relating 'EL FPL and Ei • is technical 
but intuitively clear. It states that in (conditional) 
expectation, the real loss suffered by FPL is the same 
as the estimated loss. This is simply because the loss 
estimate is unbiased. A combination with the previous 
lemma was shown in MB04 . 



Lemma 3 [EZ/^EL™] For each t>l, 



we nave 



-Etlf L - 



Note that if 1 * 1 is the loss l\ estimated by FoE, but for 
the expert I = I^ PL chosen by FPL. 

Proof. Let f t = fl(At-i) = = i\A t -i] be the 

probability distribution over actions i which FPL uses 
at time t, depending on the past randomness At— 1. 
Let ut — [l--.l]/n be the uniform distribution at time t 
(for non-uniform weights this will be replaced appro- 
priately later). Then 



FPLi, 



where i\ 



=7 t £IU/t[( 1 -<4)-o- 

=E?= 1 /# = Et['f t ]. 



U\t t \ n=lAI FdS =i 



Hi rt =iA/ FaE =j — ^tl '{ u tlt) is the estimated loss 
under the condition that FoE decided to explore (r t = 
1) and chose action lf oE ' =i. □ 



The following lemma from KV03 relates the losses of 
FPL and IFPL. We repeat the proof, since it is the 
crucial and only step in the analysis where we have 
to be careful with the upper loss bound B t . Let B t = 
Bt(n/~ft) denote the upper bound on the instantaneous 
estimated losses. 



Lemma 4 [EL FFL <EL IFFL ] ^ t if L <V t i IEPL + lt rh R 
holds for all i > 1 . 



and Bt>t\ — l{ in the fourth line, we get 



t t d^(xi)dfi(x^i) 



X i >max{?7t ( A^ — \j )+Xj } 



f m -(max{»7 t (A i -Aj)+a; ;j }) + 
= / i\ e dfl(x^i) 

^ f m ni, -( nl 5 x {')t( A »- A i)+ a; j}+')t-Bt) + 

< I l\ e Vttlt e dn(x^i) 
/ mB t f 2i -{^{vt{\ i +i\-\ j -i{)+x j })+ 

= e VtBt J TLjaa^dfiix). 

Summing over i and using the analogue of for IFPL, 
we see that if r t = l, then E^f^ <e 7 i tB *'E t lf FL holds. 
Thus E t £( m > c-^ Bt E t if PL > (1 - TitBjEtif 1 > 
Et^-rjtBf. The assertion now follows by taking 
expectations w.r.t r t . □ 

The next lemma relates the losses of IFPL and the best 
action in hindsight. For an oblivious adversary (which 
means that the adversary's decisions do not depend on 
our past actions), the proof was given in [KV03j. An 
additional step is necessary for an adaptive adversary. 
We omit the proof here, the reader may reconstruct it 
from the proof of Lemma 

Lemma 5 [F,L IFPL <L best ] Assume decreasing learn- 
ing rate rjt and^ /i e~ k <1. For allT>\ andl<i<n, 
we have 5Z t _ 1 'Et£j* PL <£\. T + ^ (recall that £\. T is a 
random variable depending on At). 



Finally, we give a relation between the estimated and 
true losses, adapted from [KIB04 . 

Lemma 6 [l best <L best ] For each T>\, 5 T e(0,l), and 
l<i<n, w.p. at least 1 — % we have 



Proof. If r t = 0, i t = and thus if*- = if^ holds. 
This happens with probability 1 — 7*. Otherwise we 
have 



TLjFPL =i t t dfi(x) 



(2) 



where (i denotes the (exponential) distribution of the 
perturbations, i.e. Xi\=q% and density /i(x) :=c~^ x ^ x . 
The idea is now that if action i was selected by FPL, 
it is - because of the exponentially distributed pertur- 
bation - with high probability also selected by IFPL. 
Formally, we write u + = max(w,0) for u G R, abbre- 
viate X = £<t + k/r)t, and denote by J ...dfj,(x^i) the 
integration leaving out the zth action. Then, using 
rjt^i — Xi <rjt\j—Xj for all j if lf PL = i in the first line, 



^T<^T + V( 21n £)^=l4 2 - (3) 



Proof. Xt=£\. f — l\. t is a martingale, since 



E[X t |A-i]=E[4 :t |A-i]-4t 

= x t _ 1 + E[ii\A t - 1 }-ei 



X t - 



Its differences are bounded: \X t — X t -i\ < B t . By 
Azuma's inequality, its actual value at time T does 



not exceed X /(21n£)£* =1 B t 2 w.p. 1-f . 



We now combine the above results and derive an up- 
per bound on the expected regret of FoE against an 
adaptive adversary. 



Theorem 7 [FoE against an adaptive adversary] Let 
n be finite and k l — Inn for all 1 < i < n. Let rjt be 
decreasing, and it € [0,B t ] n some possibly adaptive as- 
signment of loss vectors. Then for all experts i, 



^<^T + V(21n^) 




In n 

Vt 



^ + ^2 7tBt w - p - l ~ 5 T and 

t=i t=i 



\ 



t=i 



Proof. The first high probability bound follows by 
summing up all excess terms in the above lemmas, 
observing that B t = B t (n/j t )- For the second bound on 
the expectation, we take expectations in Lemmas 12151 
while Lcmma^is not used. For LemmaEl a statement 
in expectation is obtained as follows: fails w.p. at 
most 4^, in which case i\-T~ ^\-T — Y^t=i^t- 



□ 



Corollary 8 Under the conditions of Theorem^ 



(0 
(it) 

(Hi) 

(iv) 



B t = 1 
B t = 1 
B t =ti 
B t =ti 



FoE 



ijFoE 



< 



1:T 



0(n 2 T^VWT), 
_ l[ a + 0(n 2 Ti^T), 
f 0(n 2 TsVhiT), 
0(n 2 TiVh~T), 



I) FoE 
-1:T 



< 



1:T 



for all i and T . Here, (ii) and (iv) hold with probability 
l—T~ 2 . Moreover, in both cases (bounded and growing 
B t ) FoE is asymptotically optimal, i.e. 



lim sup A 

T-»oo 



(FoE 
'1:T 



min^.yj < almost surely. 



B t = ts in (Hi) and (iv) is just one choice to achieve 
asymptotic optimality while the losses may grow 
unboundedly. Asymptotic optimality is sometimes 
termed Hannan- consistency, in particular if the limit 
equals zero. We only show the upper bound. 



Proof, (i) and (ii) follow by applying the previous 
theorem to ru=t~^ , 7t=i~~, St=T~ 2 , and observing 
Ef=i* Q < jr > ta ^ 2(T+1) 1+Q for a > -± In order 

3 1 

to obtain (Hi) and (if), set T]t = t~~s, r yt = t~^, and 
6t=T~ 2 . The asymptotic optimality finally follows 
from the Borel-Cantelli Lemma, since 



1 In FoE 
T \rl:T 



< 



For i = 1,2,3,... 

Sample r t e{0,l} independently s.t. P[r t = l]=')t 
If r t = Then 

Invoke FPL T (t) and play its decision 

Set l| = for ie{t>r} 
Else 

Sample I t w.r.t. u t in J5J and play I:~l[ oE 
Set i\=l\l(u\~it) andij = for ie{t>r}\{I} 
Set £j = B t for i£{t>r} 



Figure 3. The algorithm FoE T 



Sample q\ ~ -Erp independently for i G {t > t} 
select and play lf PL = a,rgmm{r]t£ l <t + k' 1 ~ ql} 

i:t>T 



Figure 4. The algorithm FPU (t) 

4. Infinitely Many Experts and 
Arbitrary Priors 

The following considerations are valid for both finitely 
and infinitely many experts with arbitrary prior 
weights w % . For notational convenience, we write 
n — oo in the latter case. When admitting infinitely 
many experts, two difficulties arise: Since the prior 
weights of the experts sum up to one and thus become 
arbitrarily small, the estimated losses - obtained by 
dividing by these weights - would possibly get arbi- 
trarily large. We therefore introduce, for each expert 
i, a time t 1 > 1 at which the expert enters the game. All 
algorithms FoE, FPL, IFPL are substituted by coun- 
terparts FoE T , FPL T , IFPL 7 which use expert i only 
for t>T l . Thus, the maximum estimated loss possibly 
assigned to these active experts is 



B t =B t /[~ ft mm{w i : * > r*}]. 



(4) 



We denote the set of active experts at time t by {t> 
t} = {i : t > T 1 } . Experts which have not yet entered 
the game are given an estimated loss of B t . This also 
solves the computability problem: Since at every time 
t only a finite number of experts is involved, FoE T is 
computable (if each expert is). The algorithms FoE T 
and FPL T are specified in Figures and 0] 

Again, the analysis follows the outline ||TJ. Lemmas 
have equivalent counterparts, the proofs of which 
remain almost unchanged. In Lemma[31 the "uniform" 
distribution over experts ut now becomes 



(5) 



for an appropriate C>0 according to (ii) and (iv). □ 



The upper bound on the estimated loss B t in Lemma 



0] is given by @ ■ We only need to prove assertions 
corresponding to Lemmas [5] and HJ 



The assertion © - still for oblivious adversary and 
q t = q - then follows by taking expectations and using 



Lemma 9 [EL IFH7 <L bese ] Assume that J2i e " k ' < 1 
and t 1 depends monotonically on k % , i.e. t 1 >t } if and 
only if k % > k 3 . Assume decreasing learning rate rj t . 
For all T> 1 and all l<i<n, we have 



T 

E 

t=i 



e^ < h.T - 



Proof. This is a modification of the corresponding 
proofs in |KV03| and |HP04b| . We may fix the ran- 
domization A and suppress it in the notation. Then 
we only need to show 

< min{4 T + ^ii}, (6) 

l<i<n 11 

where the expectation is with respect to IFPUs ran- 
domness qi-.T- 

Assume first that the adversary is oblivious. We define 
an algorithm A as a variant of IFPLT which samples 
only one perturbation vector q in the beginning and 
uses this in each time step, i.e. qt = q. Since the ad- 
versary is oblivious, A is equivalent to IFPL T in terms 
of expected performance. This is all we need to show 
©. Let 770 = 00 and X t = i t + (k-q)(±-^- I ), then 
\ 1 . t= i 1 . t + t^L. Recall {t>T} = {i:t>T 1 }. We argue 
by induction that for all T> 1, 



V Xf < min X\. T + max { «lzz*l } 



T>T 



(7) 



This clearly holds for T = 0. For the induction step, 
we have to show 

mm X\ :T + max ^ + A^ +1 < A^ T +1 (8) 
+ max 2 — ^- + At m 1 = min Ai.t^i + max q ~ k . 

T+1>t VT+l T+l>r T+l>r T+l 

The inequality is obvious if e{T>t}. Otherwise, 
let J ^sagmax{q i -k i : ie{T>r}} . Then 



mmX\ :T +m^{3^f}<X{ :T + 



T> 



< 



t=i 

q 1 —k 1 



J2B t =J2I^<X^+ max r {^} 



t=i t=i 
shows ©. Rearranging terms in Q, we see 



V# < minAi. T +max { (± L_) . 

t>t 1J T>r- 1 ^ J X7 it -nt-ii Corollary 12 Assume the conditions of TheoremTTW 



EminA5. r < mm\£). T + ± E^-}< min\£). T + ^-} 

T>t 1T - T >t 11 Vt ' l<i<.n '' T ' 

T 

and E ~S^(q - k) 1 * (- — ) <Emax{^] < — . 



Here, (*) holds because t 1 depends monotonically on 
k\ and Eq v = l, and maximality oil\. T for T<Ti. The 
last inequality can be proven by an application of the 
union bound |HP04bl Lem.l]. 

Sampling the perturbations qt independently is equiv- 
alent under expectation to sampling q only once. So 
assume that qt are sampled independently, i.e. that 
IFPL T is played against an oblivious adversary: © re- 
mains valid. In the last step, we argue that then © 
also holds for an adaptive adversary. This is true be- 
cause the future actions of 1FPI7 do not depend on its 
past actions, and therefore the adversary cannot gain 
from deciding after having seen IFPL T 's decisions. (For 
details sec HP04a . Note the subtlety that the future 
actions of FoE T would depend on its past actions.) □ 

Lemma 10 [L bese <L best ] For each T>1, 8 T € (0,1), 



and l<i<n, we have £\. T <£\. T + ^ (21n^)^ 4=1 B^ + 
Et'i'Bt w.p. l-4f . 

This corresponds to LemmaEl The proof proceeds in a 
similar way: we have to note that £\. t — £\. t is a martin- 
gale only for t>r l , and £ l <ri exceeds £ % <ri by at most 

53I=i l B t . Then the following theorem corresponds to 
Theorem and is proven likewise. 

Theorem 11 [FoE T against an adaptive adversary] 
Let n be finite or infinite, '^2 i e~ k <1, t' depend mono- 
tonically on k l , and the learning rate r\t be decreasing. 
Let £t some possibly adaptive assignment of (true) loss 
vectors satisfying ||-£t||oo <B t . Then for all experts i, 
we have 



<e UT + J(2ln-t 
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t=l 



t=l 



t=l 



with probability 1 — 5t> where = minjw 1 : t > t 1 }. 
A corresponding statement holds for the expectation 
(compare Theorem^. 



Then for all i and T , the following holds w.p. 1 — 5t- 

(i)s t = iy = r(™r 8 i 

^^<4 T +0((i) n +T!\/hT), and 

=> i(°r < 4:T + 0((~) 22 + Ti^/WT), 

Corresponding assertions are true for the expectation 
(compare Corollary 0). In both cases (bounded and 
growing B t ) FoE is asymptotically optimal w.r.t. each 
expert: limsupj^^y U^p— &\-t) — a -S- for alii. 

Proof. Let w t =t~i, -f t =t~i, and St — T^ 2 . For 
t 1 = \(w l )~ a ~\ and Bt — t^, we have w T =mm{w l :T> 
[(to 1 )- Q ] } > min{V : T~i < w 1 } } > T~i and 

5> < ^ - i)b t ^ < y r 3— < 

(observe w* rl _ 1 > {t 1 - > (w 1 )^ ^-^). Then set 
a — 8, /3 = 0, for (i) and a = 16, f3=j^ for (m). Asymp- 
totic optimality is shown as in Corollary |HJ □ 

5. Active Expert Problems and a 
Universal Master Algorithm 

If the adversary's goal is just to maximize our (ex- 
pected) regret, then it is well known what he can 
achieve (at least for uniform prior, see e.g. the lower 
bound in |CB97llATTBF502] 'l. We are interested in dif- 
ferent situations. An example is the repeated playing 
of the "Prisoner's dilemma" against the Tit-for-Tat 1 
strategy |FM04| . If we use two strategies as experts, 
namely "always cooperate" and "always defect" , then 
it is clear that always cooperating will have the bet- 
ter long-term reward. It is also clear that a standard 
expert advice or bandit master algorithm will not dis- 
cover this, since it compares only the losses in one step, 
which are always lower for the defecting expert. 

We therefore propose to give the control to a selected 
expert for periods of increasing length. Precisely, we 
introduce a new time scale t at which we have single 
games with losses 1$. The master's time scale t does 
not coincide with t. Instead, at each t, the master gives 
control to the selected expert i for Tj single games and 

x In the prisoner's dilemma, two players both decide in- 
dependently if thy are cooperating (C) or defecting (D). If 
both play C, they get both a small loss, if both play D, 
they get a large loss. However, if one plays C and one D, 
the cooperating player gets a very large loss and the de- 
fecting player no loss at all. Thus defecting is a dominant 
strategy. A Tit-for-Tat player play C in the first move and 
afterwards the opponent's respective preceding move. 



receives loss ^\ = Y^—^ t Assume that the game 
has bounded instantaneous losses [0,1]. Then the 
master algorithm's instantaneous losses are bounded 
by T t . We denote this algorithm by FoEf or FoE T . 

Corollary 13 Assume FoEf (or FoEOf, respectively) 
plays a repeated game with bounded instantaneous 
losses l~ €[0,1]. Let the exploration and learning rates 

1 3 

be 7t =t~i and rj t =t~ 1 4 . In case of uniform prior, 
choose T t = \ t*\ (t 1 = 0). In case of arbitrary prior let 
Tt = L iTS J an d t 1 — [(w 1 ) -16 ] . Then for all experts i 
and all T, suppressing the dependence on the prior of 
expert i, we have 

e FoEt < £i i f + 0(f^>) w.p. 1-f 2 and 
E^f f < e\. f + 0(f to). 

Consequently, lim sup^^^ (^ j? T — £\.f)/T < almost 
surely. The rate of convergence is at least T~w. The 
same assertions hold for FoElf. 

Proof. This follows from changing the time scale from 
t to t in Corollaries IH1 and 1121 t is of order t 1+ s in the 
uniform case and t 1+ Te in the general case. T hen t he 
bounds are i/lnT in the former and Tt? \/lnT in 

~ 9 

the latter case. Both are upper bounded by T to . □ 

Broadly spoken, this means that FoEf performs 
asymptotically as well as the best expert. Asymptotic 
guarantees for the Strategic Experts Algorithm have 
been derived by |FM04| . Our results approve upon 
this by providing a rate of convergence. One can give 
further corollaries, e.g. in terms of flexibility as defined 
by |FM04j . 

It is also possible to specify a universal experts algo- 
rithm. To this aim, let expert i be derived from the 
ith program p z of some fixed universal Turing machine. 
The ith program can be well-defined, e.g. by repre- 
senting programs as binary strings and lexicographi- 
cally ordering them Hut04 . Before the expert is con- 
sulted, the relevant input is written to the input tape 
of the corresponding program. If the program halts, 
the appropriate number of first bits is interpreted as 
the expert's recommendation. E.g. if the decision is 
binary, then the first bit suffices. (If the program 
does not halt, we may for well-definedness just fill its 
output tape with zeros.) Each expert is assigned a 
prior weight by w l = 2 _len s th (P ) , where length (p*) is 
the length of the corresponding program and we as- 
sume the program tape to be binary. This construc- 
tion parallels the definition of Solomonoff's universal 
prior |Sol78| . This has been used to define a universal 
agent AIXI in a quite different way by |Hut04| . Note 
that like the universal prior and AIXI, our universal 



agent is not computable, since we cannot check if a pro- 
gram halts. It is however straightforward to impose a 
bound on the computation time which for instance in- 
creases rapidly in t. If used with computable experts, 
the algorithm is computationally feasible. The univer- 
sal master algorithm performs well with respect to any 
computable strategy. 

Corollary 14 Assume the universal set of experts 
specified in the last paragraph. If FoEJf is applied with 
-ft = t-4, m = t~i, f t = [tTe\, and t 1 = r(w ,; )- 16 l, 
then it performs asymptotically at least as good as any 
computable expert i. The rate of convergence is expo- 
nential in the complexity k l and proportional to T~~n> . 

6. Discussion 

For large or infinite expert classes, the bounds we have 
proven are irrelevant in practice, although asserting al- 
most sure optimality and even a convergence rate: the 
exponential of the complexity is far too huge. Imag- 
ine for instance a moderately complex task and some 
good strategy, which can be coded with mere 500 bits. 
Then its weight is 2 -500 , a constant which is not distin- 
guishable from zero in all practical situations. Thus, 
it seems that the bounds can be relevant at most for 
small expert classes with uniform prior. This is a gen- 
eral shortcoming of bandit experts algorithms: For 
uniform prior a lower bound on the expected loss which 
is linear in %fn has been proven ACBFS02 . 

If the bounds are not practically relevant, maybe the 
algorithms are so? We leave this interesting question 
unanswered. Intuitively, it might seem that the al- 
gorithms proposed here are too much tailored towards 
worst-case bounds and fully adversarial setups. For ex- 
ample, the exploration rate of t~4 is quite high. Mas- 
ter algorithms which are less "cautious" might perform 
better for many practical problems. Finally, it would 
be nice to investigate the differences between the pro- 
posed expert style approach and other definitions of 
universal agents, such as by |Hut04j . 
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